Social determinants of health, such as inadequate housing, lack of access to reliable transportation, and insufficient food access, have increasingly been recognized as being factors that adversely impact health outcomes.(1) In this project, with guidance from advisors John Holmes, PhD and Sameh Saleh, MD, I attempt to determine if there is an observed association between low food access and type 2 diabetes prevalence in the United States for the most recent year for which data about both were available, 2015. It is possible that there is a positive association between decreased food access and type 2 diabetes prevalence, as individuals in low food access areas may have little choice but to rely on convenience foods and fast foods for their nutritional intake, which is likely to increase the risk of developing type 2 diabetes. I will use the AHRQ Social Determinants of Health dataset for this work.
Link to final project Github repository: https://github.com/mcgreevj/BMIN503_Final_Project
Type 2 diabetes is a major health problem in the United States, with significant morbidity and mortality for the at least 33 million people who suffer with this condition. Additionally, the economic costs of type 2 diabetes in the aggregate are substantial, totaling $327 billion in 2017.(2) Tragically, type 2 diabetes is largely a preventable disease that has modifiable risk factors (weight, diet, exercise).(3) Increasingly, there has been recognition that social determinants of health can serve as barriers that prevent individuals from being able to successfully modify risk factors. For example, individuals who lack access to insurance and/or reliable transportation may not be able to seek routine medical care for education, prevention, monitoring, and treatment purposes. As noted, diet is a major factor in the development of type 2 diabetes. And while ideal diets to prevent type 2 diabetes have been advised, not every American necessarily has access to food to be able to consume these recommended diets.(4) This is the phenomenon of food insecurity. Food insecurity is formally defined by the US Department of Agriculture as: “the limited or uncertain availability of nutritionally adequate and safe foods, or limited or uncertain ability to acquire acceptable foods in socially acceptable ways.”(5) One factor that can lead to lack of access to healthy foods is the phenomenon of food deserts – areas that are low income and that have few or no supermarkets or other sources of healthy food options. As formally defined by the USDA, a food desert is: “A census tract that meets both low-income and low-access criteria including: 1. poverty rate is greater than or equal to 20 percent OR median family income does not exceed 80 percent statewide (rural/urban) or metro-area (urban) median family income; 2. at least 500 people or 33 percent of the population located more than 1 mile (urban) or 10 miles (rural) from the nearest supermarket or large grocery store.”(6) In the absence of options for healthy food, individuals residing in these communities may need to resort to convenience and fast foods for some or many of their meals. In doing so, they increase the risk of developing type 2 diabetes.(7)
This project is interdisciplinary in that it combines census tract, geographical data with health survey data. Notably the AHRQ SDOH database integrates data from multiple surveys and sources. The fields and topics that contribute to understanding of this issue are multiple: public health, clinical medicine, sociology, health disparities, economics, geography, history (in that the reasons many communities have limited access to food involves historical structural decisions, including forms of structural racism such as redlining). Moreover, the findings of this project have social justice, ethical, cultural, economic, and political implications. Based on collaboration with my mentors, I have learned the importance of familiarizing myself with data sources and data definitions, notably the Data Source Documentation guide from AHRQ (AHRQ Social Determinants of Health (SDOH) Database). This step is necessary from an exploratory standpoint to contemplate what kind of project I might do and also from a practical standpoint to understand how to properly and specifically answer my project question with the available data. I have also learned that some questions may be challenging to answer exactly as initially envisioned. For example, when considering this project at first, I had hoped to show changes over time in the association between limited food access and type 2 diabetes prevalence, perhaps including recent data from 2020 or 2021. However, after close inspection of this SDOH data set, I have realized that not all data is available for every year of interest. Thus I have needed to narrow the scope of my project a bit, given that data for limited food access and diabetes prevalence intersect in 2015, which is the most recent year that both types of data are available.
The identification of an independent association between low food access and type 2 diabetes prevalence would help support the case for policies and programs (both at the national and state levels) to improve food access in impacted census tracts. From a social justice standpoint, those with power in society (elected office holders, business and community leaders, members of the health care community including researchers) have a responsibility to seek ways to provide a known and effective preventive therapy (whether a vaccine or a grocery store) to those without such access currently, to prevent disease. There is also a national public health and cost dimension to this topic. People with diabetes suffer complications and morbidity related to diabetes, impairing their ability to support themselves, live independent lives, and contribute to a community’s economic activity and well-being. At the same time, the national cost of providing care for individuals with type 2 diabetes is borne by all Americans, in part because individuals living in low access food areas are more likely to have public (Medicaid, Medicare) insurance, if they have insurance at all, that is funded by all U.S. taxpayers. Indeed, Medicaid and Medicare combined provide health coverage to approximately 36% of Americans today and that percentage continues to rise over time.(8) “Uncompensated” care for individuals with type 2 diabetes who lack any type of insurance is also a cost borne by individual health care organizations, governments, and ultimately, taxpayers. Greater awareness of the relationships between SDOH and health outcomes, combined with a blunt accounting of the costs of inaction on this subject, will hopefully spur actions that can help mitigate SDOH and improve both individual and community health.
My project attempts a descriptive analysis, that is: adjusting for obesity, uninsured status, and household income, is there an association between food access and diabetes percentage?
For this project, I used the AHRQ Social Determinants of Health database (available at https://www.ahrq.gov/sdoh/data-analytics/sdoh-data.html). I reviewed the data source documentation document (https://www.ahrq.gov/sites/default/files/wysiwyg/sdoh/SDOH-Data-Sources-Documentation-v1-Final.pdf) to learn about potential variables for this project and their availability. Ultimately, I planned to use two main variables available in the SDOH database: CHR_PCT_DM and FARA_PCT_LA_HALF_10_POP.
CHR_PCT_DIABETES is defined by the data source document above as the “percentage of adults with diagnosed diabetes (ages 20 and over).” Data for this variable is provided by the CDC Interactive Diabetes Atlas which in turn uses BRFSS data to estimate prevalence by U.S. county. Notably, starting in 2015, BRFSS included both landline and cellphone survey data to obtain diabetes prevalence data. Previously, this data had been collected exclusively via landline surveys.
Importantly, CHR_PCT_DIABETES is provided by U.S. county.
FARA_PCT_LA_HALF_10_POP is defined as the “percentage of [healthy food] low access population measured at 1/2 mile for urban areas and 10 miles for rural areas.” For this variable, low access to healthy food is defined as being far from a supermarket, supercenter, or large grocery store, so more than 1/2 mile in urban areas and more than 10 miles in rural areas for this variable.
FARA_PCT_LA_HALF_10_POP comes from the Food Access Research Atlas (FARA), created by the Economic Research Service (ERS) at the United States Department of Agriculture. A mapping tool is offered by the ERS and is available here: https://www.ers.usda.gov/data-products/food-access-research-atlas/go-to-the-atlas/.
In contrast to CHR_PCT_DIABETES, FARA_PCT_LA_HALF_10_POP is provided at the census tract level.
For further reference, the following is the ERS methodology for calculating distance to the nearest grocery store, excerpted from the ERS definitions document available at https://www.ers.usda.gov/data-products/food-access-research-atlas/documentation/:
“First, the entire country is divided into ½-kilometer (about 1/3-mile)-square grids, and data on the population are aerially allocated to these grids (see Low-Income and Low-Foodstore-Access Census Tracts, 2015-19 in the link below). Distance to the nearest supermarket is measured for each grid cell by calculating the distance between the geographic center of the ½-kilometer square grid that contains estimates of the population (number of people and other subgroup characteristics) and the center of the grid with the nearest supermarket.”
While I had hoped to be able to include recent data for this project, the most recent year that data for both CHR_PCT_DIABETES and FARA_PCT_LA_HALF_10_POP is available is 2015.
After consulting with Drs. Holmes and Saleh about variable choices, I loaded the appropriate datasets into R.
For CHR_PCT_DIABETES, the 2015 data (by county) was available as an Excel file here: https://view.officeapps.live.com/op/view.aspx?src=https%3A%2F%2Fwww.ahrq.gov%2Fsites%2Fdefault%2Ffiles%2Fwysiwyg%2Fsdoh%2FSDOH_2015_COUNTY_1_0.xlsx&wdOrigin=BROWSELINK.
For FARA_PCT_LA_HALF_10_POP, the 2015 data (by tract) was available as an Excel file here: https://www.ahrq.gov/downloads/sdoh/sdoh_2015_tract_1_0.xlsx.
Here I am loading the package readxl that will allow me to read Excel data
library("readxl")Read in the 2015 low access data, reported by census tract
Lowaccess_raw<- read_excel("C:\\Users\\McGreevJ\\Desktop\\BMI 5030 Data Science\\BMIN503_Final_Project\\Adjusted datasets\\data.sdoh_2015_tract_1_0.xlsx")
dim(Lowaccess_raw) #74133 x 405## [1] 74133 405
Read in the 2015 diabetes prevalence data, reported by county
Diabetes_raw<- read_excel("C:\\Users\\McGreevJ\\Desktop\\BMI 5030 Data Science\\BMIN503_Final_Project\\Adjusted datasets\\data.SDOH_2015_COUNTY_1_0.xlsx")## Warning: Expecting logical in RJ1671 / R1671C478: got '46123'
## Warning: Expecting logical in RJ1763 / R1763C478: got '32510'
## Warning: Expecting logical in RK1763 / R1763C479: got '41025'
## Warning: Expecting logical in RL1763 / R1763C480: got '41037'
## Warning: Expecting logical in RJ2797 / R2797C478: got '49017'
## Warning: Expecting logical in RK2797 / R2797C479: got '49019'
## Warning: Expecting logical in RL2797 / R2797C480: got '49025'
## Warning: Expecting logical in RM2797 / R2797C481: got '49055'
## Warning: Expecting logical in RJ2842 / R2842C478: got '51760'
dim(Diabetes_raw) #3234 x 979## [1] 3234 979
Here I loaded other relevant packages and then checked for differences between the Lowaccess_raw and Diabetes_raw datasets; notably, to see if the same COUNTYFIPS codes were represented in both.
library(tidyverse)## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(cowplot)
library(dplyr)
setdiff(Lowaccess_raw$COUNTYFIPS, Diabetes_raw$COUNTYFIPS)## [1] "69085"
#COUNTYFIPS 69085 (Northern Mariana Islands) is the only difference. 69085 is represented in Lowaccess_raw but not in Diabetes_raw.Once loaded into R, I manipulated each of these raw datasets so as to limit them to the variables of interest, tracking dimensions of the dataframes as I went along.
#Define "Lowaccess" as a subset of "Lowaccess_raw" to include TRACTFIPS, COUNTYFIPS, STATEFIPS, STATE, COUNTY, REGION, ACS_TOT_POP_WT, FARA_TRACT_LILA_HALF_10, FARA_PCT_LA_HALF_10_POP
Lowaccess<- dplyr::select(Lowaccess_raw, TRACTFIPS, COUNTYFIPS, STATEFIPS, STATE, COUNTY, REGION, ACS_TOT_POP_WT, FARA_TRACT_LILA_HALF_10, FARA_PCT_LA_HALF_10_POP)
dim(Lowaccess) #74133 x 9## [1] 74133 9
#Define "Diabetes" as a subset of "Diabetes_raw" to include COUNTYFIPS, STATEFIPS, STATE, COUNTY, REGION, ACS_TOT_POP_WT, ACS_MEDIAN_AGE, ACS_PCT_EMPLOYED, ACS_MEDIAN_HH_INC, ACS_PCT_HH_INC_10000, ACS_PCT_HH_INC_100000, ACS_PER_CAPITA_INC, ACS_PCT_POV_AIAN, ACS_MEDIAN_HOME_VALUE, ACS_PCT_MEDICAID_ANY, ACS_PCT_MEDICAID_ANY_BELOW64, ACS_PCT_MEDICARE_ONLY, ACS_PCT_PRIVATE_ANY, ACS_PCT_UNINSURED, CEN_AIAN_NH_IND, ACS_TOT_HH, ACS_AVG_HH_SIZE, ACS_PCT_CHILD_1FAM, CHR_PCT_ADULT_OBESITY, CHR_PCT_DIABETES
Diabetes<- dplyr::select(Diabetes_raw, COUNTYFIPS, STATEFIPS, STATE, COUNTY, REGION, ACS_TOT_POP_WT, ACS_MEDIAN_AGE, ACS_PCT_EMPLOYED, ACS_MEDIAN_HH_INC, ACS_PCT_HH_INC_10000, ACS_PCT_HH_INC_100000, ACS_PER_CAPITA_INC, ACS_PCT_POV_AIAN, ACS_MEDIAN_HOME_VALUE, ACS_PCT_MEDICAID_ANY, ACS_PCT_MEDICAID_ANY_BELOW64, ACS_PCT_MEDICARE_ONLY, ACS_PCT_PRIVATE_ANY, ACS_PCT_UNINSURED, CEN_AIAN_NH_IND, ACS_TOT_HH, ACS_AVG_HH_SIZE, ACS_PCT_CHILD_1FAM, CHR_PCT_ADULT_OBESITY, CHR_PCT_DIABETES)
dim(Diabetes) #3234 x 25## [1] 3234 25
In anticipation of ultimately needing a county population value in order to calculate a weighted mean of FARA_PCT_LA_HALF_10_POP by county, I renamed ACS_TOT_POP_WT within the Diabetes dataframe to County_Pop. I then confirmed the column had been renamed and checked the dimensions of Diabetes to assure they remained unchanged.
Diabetes<- dplyr::rename(Diabetes, "County_Pop" = ACS_TOT_POP_WT)
head(Diabetes)## # A tibble: 6 × 25
## COUNTYFIPS STATE…¹ STATE COUNTY REGION Count…² ACS_M…³ ACS_P…⁴ ACS_M…⁵ ACS_P…⁶
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 01001 01 Alab… Autau… South 55221 37.7 92.4 51281 6.14
## 2 01003 01 Alab… Baldw… South 195121 42.2 92.5 50254 6.36
## 3 01005 01 Alab… Barbo… South 26932 38.8 82.4 32964 13.5
## 4 01007 01 Alab… Bibb … South 22604 38.9 91.7 38678 7.74
## 5 01009 01 Alab… Bloun… South 57710 40.7 92.3 45813 7.77
## 6 01011 01 Alab… Bullo… South 10678 39.3 82.0 31938 18.9
## # … with 15 more variables: ACS_PCT_HH_INC_100000 <dbl>,
## # ACS_PER_CAPITA_INC <dbl>, ACS_PCT_POV_AIAN <dbl>,
## # ACS_MEDIAN_HOME_VALUE <dbl>, ACS_PCT_MEDICAID_ANY <dbl>,
## # ACS_PCT_MEDICAID_ANY_BELOW64 <dbl>, ACS_PCT_MEDICARE_ONLY <dbl>,
## # ACS_PCT_PRIVATE_ANY <dbl>, ACS_PCT_UNINSURED <dbl>, CEN_AIAN_NH_IND <chr>,
## # ACS_TOT_HH <dbl>, ACS_AVG_HH_SIZE <dbl>, ACS_PCT_CHILD_1FAM <dbl>,
## # CHR_PCT_ADULT_OBESITY <dbl>, CHR_PCT_DIABETES <dbl>, and abbreviated …
dim(Diabetes) #3234 x 25## [1] 3234 25
I performed an innerjoin of Diabetes and Lowaccess by the variable that was common to both, COUNTYFIPS, recognizing that this step would only preserve rows that both dataframes had in common. I inspected the header of the joined dataframe and checked its dimensions again, noting the number of rows had been reduced from 74133 to 74132.
#Innerjoin Diabetes and Lowaccess
test_innerjoin<- inner_join(Diabetes, Lowaccess, by = "COUNTYFIPS")
head(test_innerjoin)## # A tibble: 6 × 33
## COUNTYFIPS STATEFIPS.x STATE.x COUNT…¹ REGIO…² Count…³ ACS_M…⁴ ACS_P…⁵ ACS_M…⁶
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 01001 01 Alabama Autaug… South 55221 37.7 92.4 51281
## 2 01001 01 Alabama Autaug… South 55221 37.7 92.4 51281
## 3 01001 01 Alabama Autaug… South 55221 37.7 92.4 51281
## 4 01001 01 Alabama Autaug… South 55221 37.7 92.4 51281
## 5 01001 01 Alabama Autaug… South 55221 37.7 92.4 51281
## 6 01001 01 Alabama Autaug… South 55221 37.7 92.4 51281
## # … with 24 more variables: ACS_PCT_HH_INC_10000 <dbl>,
## # ACS_PCT_HH_INC_100000 <dbl>, ACS_PER_CAPITA_INC <dbl>,
## # ACS_PCT_POV_AIAN <dbl>, ACS_MEDIAN_HOME_VALUE <dbl>,
## # ACS_PCT_MEDICAID_ANY <dbl>, ACS_PCT_MEDICAID_ANY_BELOW64 <dbl>,
## # ACS_PCT_MEDICARE_ONLY <dbl>, ACS_PCT_PRIVATE_ANY <dbl>,
## # ACS_PCT_UNINSURED <dbl>, CEN_AIAN_NH_IND <chr>, ACS_TOT_HH <dbl>,
## # ACS_AVG_HH_SIZE <dbl>, ACS_PCT_CHILD_1FAM <dbl>, …
dim(test_innerjoin) #74132 x 33## [1] 74132 33
I expected the one COUNTYFIPS discrepancy (Northern Mariana Islands, code 69085 as noted previously) to have been excluded from this innerjoined dataframe and checked for its presence.
#Visually inspecting to see if Northern Mariana (69085) is present in test_innerjoin
View(test_innerjoin)A search of the table above for 69085 yielded no results, so Northern Mariana had been excluded, as desired and as expected.
I then created a new variable to represent weighted population, census_tract_pop_wt, which is the tract population divided by the county population. Again, I inspected the header to verify the variable had been successfully added. I once again checked dimensions and as expected, noted the number of variables had increased by one, from 33 to 34.
test_innerjoin<- dplyr::mutate(test_innerjoin, "census_tract_pop_wt" = (ACS_TOT_POP_WT/County_Pop))
head(test_innerjoin)## # A tibble: 6 × 34
## COUNTYFIPS STATEFIPS.x STATE.x COUNT…¹ REGIO…² Count…³ ACS_M…⁴ ACS_P…⁵ ACS_M…⁶
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 01001 01 Alabama Autaug… South 55221 37.7 92.4 51281
## 2 01001 01 Alabama Autaug… South 55221 37.7 92.4 51281
## 3 01001 01 Alabama Autaug… South 55221 37.7 92.4 51281
## 4 01001 01 Alabama Autaug… South 55221 37.7 92.4 51281
## 5 01001 01 Alabama Autaug… South 55221 37.7 92.4 51281
## 6 01001 01 Alabama Autaug… South 55221 37.7 92.4 51281
## # … with 25 more variables: ACS_PCT_HH_INC_10000 <dbl>,
## # ACS_PCT_HH_INC_100000 <dbl>, ACS_PER_CAPITA_INC <dbl>,
## # ACS_PCT_POV_AIAN <dbl>, ACS_MEDIAN_HOME_VALUE <dbl>,
## # ACS_PCT_MEDICAID_ANY <dbl>, ACS_PCT_MEDICAID_ANY_BELOW64 <dbl>,
## # ACS_PCT_MEDICARE_ONLY <dbl>, ACS_PCT_PRIVATE_ANY <dbl>,
## # ACS_PCT_UNINSURED <dbl>, CEN_AIAN_NH_IND <chr>, ACS_TOT_HH <dbl>,
## # ACS_AVG_HH_SIZE <dbl>, ACS_PCT_CHILD_1FAM <dbl>, …
dim(test_innerjoin) #74132 x 34## [1] 74132 34
Having visually inspected my data, I noted that some tracts had NA values for FARA_PCT_LA_HALF_10_POP. Recognizing that my weighting calculations would produce an error if I attempted to perform them using NA values, I assessed to see how many NA values were present for this variable.
sum(is.na(test_innerjoin$FARA_PCT_LA_HALF_10_POP))## [1] 1625
Finding 1625 NA values, I removed these NA values from test_innerjoin so as to allow me to perform the weighting calculations.
test_innerjoin<- test_innerjoin %>% drop_na(FARA_PCT_LA_HALF_10_POP)I then inspected my updated test_innerjoin dataframe dimensions to verify that all 1625 NA values had been removed. Indeed, the rows had decreased from 74132 to 72507 as expected. Visual inspection of test_innerjoin confirmed no remaining NA values for FARA_PCT_LA_HALF_10_POP.
dim(test_innerjoin) ## [1] 72507 34
View(test_innerjoin) I created a new dataframe based on test_innerjoin that contained a newly-defined weighted mean variable, wm_FARA, using the weighted.mean function, which was the FARA_PCT_LA_HALF_10_POP for a tract multiplied by census_tract_pop_wt. I summarized wm_FARA and grouped results by COUNTYFIPS (one row per county). Other variables of interest for the analysis were included as well.
colnames(test_innerjoin)## [1] "COUNTYFIPS" "STATEFIPS.x"
## [3] "STATE.x" "COUNTY.x"
## [5] "REGION.x" "County_Pop"
## [7] "ACS_MEDIAN_AGE" "ACS_PCT_EMPLOYED"
## [9] "ACS_MEDIAN_HH_INC" "ACS_PCT_HH_INC_10000"
## [11] "ACS_PCT_HH_INC_100000" "ACS_PER_CAPITA_INC"
## [13] "ACS_PCT_POV_AIAN" "ACS_MEDIAN_HOME_VALUE"
## [15] "ACS_PCT_MEDICAID_ANY" "ACS_PCT_MEDICAID_ANY_BELOW64"
## [17] "ACS_PCT_MEDICARE_ONLY" "ACS_PCT_PRIVATE_ANY"
## [19] "ACS_PCT_UNINSURED" "CEN_AIAN_NH_IND"
## [21] "ACS_TOT_HH" "ACS_AVG_HH_SIZE"
## [23] "ACS_PCT_CHILD_1FAM" "CHR_PCT_ADULT_OBESITY"
## [25] "CHR_PCT_DIABETES" "TRACTFIPS"
## [27] "STATEFIPS.y" "STATE.y"
## [29] "COUNTY.y" "REGION.y"
## [31] "ACS_TOT_POP_WT" "FARA_TRACT_LILA_HALF_10"
## [33] "FARA_PCT_LA_HALF_10_POP" "census_tract_pop_wt"
test_innerjoin_wtmean<- test_innerjoin %>% dplyr::group_by(COUNTYFIPS, COUNTY.x, STATE.x, REGION.x, County_Pop, ACS_MEDIAN_HH_INC, ACS_PER_CAPITA_INC, ACS_PCT_POV_AIAN, ACS_MEDIAN_HOME_VALUE, ACS_PCT_UNINSURED, CHR_PCT_ADULT_OBESITY, CHR_PCT_DIABETES) %>% dplyr::summarise(wm_FARA = weighted.mean(FARA_PCT_LA_HALF_10_POP, census_tract_pop_wt)) %>% as.data.frame()## `summarise()` has grouped output by 'COUNTYFIPS', 'COUNTY.x', 'STATE.x',
## 'REGION.x', 'County_Pop', 'ACS_MEDIAN_HH_INC', 'ACS_PER_CAPITA_INC',
## 'ACS_PCT_POV_AIAN', 'ACS_MEDIAN_HOME_VALUE', 'ACS_PCT_UNINSURED',
## 'CHR_PCT_ADULT_OBESITY'. You can override using the `.groups` argument.
Once again, I inspected this new dataframe to verify that I had created wm_FARA successfully and that each county had a single row in the dataframe. I again checked the dimensions of this new dataframe.
head(test_innerjoin_wtmean)## COUNTYFIPS COUNTY.x STATE.x REGION.x County_Pop ACS_MEDIAN_HH_INC
## 1 01001 Autauga County Alabama South 55221 51281
## 2 01003 Baldwin County Alabama South 195121 50254
## 3 01005 Barbour County Alabama South 26932 32964
## 4 01007 Bibb County Alabama South 22604 38678
## 5 01009 Blount County Alabama South 57710 45813
## 6 01011 Bullock County Alabama South 10678 31938
## ACS_PER_CAPITA_INC ACS_PCT_POV_AIAN ACS_MEDIAN_HOME_VALUE ACS_PCT_UNINSURED
## 1 24974 52.20 141300 10.12
## 2 27317 9.15 169300 12.96
## 3 16824 0.00 92200 15.51
## 4 18431 4.65 102700 9.66
## 5 20532 6.60 119800 11.64
## 6 17580 0.00 68600 18.09
## CHR_PCT_ADULT_OBESITY CHR_PCT_DIABETES wm_FARA
## 1 37.5 14.2 54.81617
## 2 31.0 11.3 40.21763
## 3 44.3 18.0 31.32199
## 4 37.8 14.9 1.18436
## 5 34.4 14.3 13.44140
## 6 39.4 20.0 70.79069
dim(test_innerjoin_wtmean) #3140 x 13## [1] 3140 13
I noted that while the earlier county-level dataframe (Diabetes) had 3234 rows, test_innerjoin_wtmean only had 3140 rows. I went about an analysis to understand why there was a difference in row counts starting with setdiff.
setdiff(Diabetes$COUNTYFIPS, test_innerjoin_wtmean$COUNTYFIPS)## [1] "02158" "46102" "60000" "60010" "60020" "60030" "60040" "60050" "66010"
## [10] "69000" "69100" "69110" "69120" "72001" "72003" "72005" "72007" "72009"
## [19] "72011" "72013" "72015" "72017" "72019" "72021" "72023" "72025" "72027"
## [28] "72029" "72031" "72033" "72035" "72037" "72039" "72041" "72043" "72045"
## [37] "72047" "72049" "72051" "72053" "72054" "72055" "72057" "72059" "72061"
## [46] "72063" "72065" "72067" "72069" "72071" "72073" "72075" "72077" "72079"
## [55] "72081" "72083" "72085" "72087" "72089" "72091" "72093" "72095" "72097"
## [64] "72099" "72101" "72103" "72105" "72107" "72109" "72111" "72113" "72115"
## [73] "72117" "72119" "72121" "72123" "72125" "72127" "72129" "72131" "72133"
## [82] "72135" "72137" "72139" "72141" "72143" "72145" "72147" "72149" "72151"
## [91] "72153" "78010" "78020" "78030"
There were 94 counties in Diabetes that were not in test_innerjoin_wtmean. Using “View(Diabetes)” I visually inspected the Diabetes table again to look for a pattern among these 94 counties in question.
These counties appeared to be in Alaska and U.S. territories. Examples included: 02158 (Kusilvak Census Area, AK), 60010 (Am Samoa), 72151 (Puerto Rico), 78030 (US Virgin Islands). Excluding these counties from test_innerjoin_wtmean was acceptable given that I did not have the ability to plot choropleths (see counties RDS file below) with these non-continental U.S. counties.
I next began preparing for the exploratory analysis by loading various necessary packages first and then loading county geography data.
library(rgdal)## Warning: package 'rgdal' was built under R version 4.2.2
## Loading required package: sp
## Please note that rgdal will be retired during 2023,
## plan transition to sf/stars/terra functions using GDAL and PROJ
## at your earliest convenience.
## See https://r-spatial.org/r/2022/04/12/evolution.html and https://github.com/r-spatial/evolution
## rgdal: version: 1.6-2, (SVN revision 1183)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 3.5.2, released 2022/09/02
## Path to GDAL shared files: C:/Users/McGreevJ/AppData/Local/Programs/R/R-4.2.1/library/rgdal/gdal
## GDAL binary built with GEOS: TRUE
## Loaded PROJ runtime: Rel. 8.2.1, January 1st, 2022, [PJ_VERSION: 821]
## Path to PROJ shared files: C:/Users/McGreevJ/AppData/Local/Programs/R/R-4.2.1/library/rgdal/proj
## PROJ CDN enabled: FALSE
## Linking to sp version:1.5-1
## To mute warnings of possible GDAL/OSR exportToProj4() degradation,
## use options("rgdal_show_exportToProj4_warnings"="none") before loading sp or rgdal.
library(leaflet)## Warning: package 'leaflet' was built under R version 4.2.2
library(ggplot2)
library(RColorBrewer)
library(plyr)## Warning: package 'plyr' was built under R version 4.2.2
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:purrr':
##
## compact
library(sf)## Warning: package 'sf' was built under R version 4.2.2
## Linking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1; sf_use_s2() is TRUE
library(tidycensus)## Warning: package 'tidycensus' was built under R version 4.2.2
library(ggsn)## Warning: package 'ggsn' was built under R version 4.2.2
## Loading required package: grid
counties<- readRDS(gzcon(url("https://raw.githubusercontent.com/HimesGroup/BMIN503/master/DataFiles/uscounties_2010.rds")))
st_crs(counties) <- st_crs(counties)## old-style crs object detected; please recreate object with a recent sf::st_crs()
## old-style crs object detected; please recreate object with a recent sf::st_crs()
#Using base plot function to check that counties contains polygon data:
plot(counties)#Using ggplot to check that counties contains polygon data:
ggplot() +
geom_sf(data = counties)Here I began the process of joining test_innerjoin_wtmean and counties but first needed to create a common variable by which to join. I converted State to a 2 digit number in counties, then concatenated State and County into a 5 digit number, a mutated variable in counties called “COUNTYFIPS” to correspond to “COUNTYFIPS” in the combined data set, test_innerjoin_wtmean.
counties$STATE<- str_pad(counties$STATE, width=2, side="left", pad="0")
head(counties)## Simple feature collection with 6 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -88.13999 ymin: 31.19518 xmax: -85.12342 ymax: 34.8922
## Geodetic CRS: WGS 84
## GEO_ID STATE COUNTY NAME LSAD CENSUSAREA
## 1 0500000US01001 01 001 Autauga County 594.436
## 2 0500000US01009 01 009 Blount County 644.776
## 3 0500000US01017 01 017 Chambers County 596.531
## 4 0500000US01021 01 021 Chilton County 692.854
## 5 0500000US01033 01 033 Colbert County 592.619
## 6 0500000US01045 01 045 Dale County 561.150
## geometry
## 1 MULTIPOLYGON (((-86.49677 3...
## 2 MULTIPOLYGON (((-86.5778 33...
## 3 MULTIPOLYGON (((-85.18413 3...
## 4 MULTIPOLYGON (((-86.51734 3...
## 5 MULTIPOLYGON (((-88.13999 3...
## 6 MULTIPOLYGON (((-85.41644 3...
counties<- counties %>% dplyr::mutate(COUNTYFIPS = paste(STATE, COUNTY, sep = ""))
head(counties)## Simple feature collection with 6 features and 7 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -88.13999 ymin: 31.19518 xmax: -85.12342 ymax: 34.8922
## Geodetic CRS: WGS 84
## GEO_ID STATE COUNTY NAME LSAD CENSUSAREA
## 1 0500000US01001 01 001 Autauga County 594.436
## 2 0500000US01009 01 009 Blount County 644.776
## 3 0500000US01017 01 017 Chambers County 596.531
## 4 0500000US01021 01 021 Chilton County 692.854
## 5 0500000US01033 01 033 Colbert County 592.619
## 6 0500000US01045 01 045 Dale County 561.150
## geometry COUNTYFIPS
## 1 MULTIPOLYGON (((-86.49677 3... 01001
## 2 MULTIPOLYGON (((-86.5778 33... 01009
## 3 MULTIPOLYGON (((-85.18413 3... 01017
## 4 MULTIPOLYGON (((-86.51734 3... 01021
## 5 MULTIPOLYGON (((-88.13999 3... 01033
## 6 MULTIPOLYGON (((-85.41644 3... 01045
Also before joining test_innerjoin_wtmean and counties, I inspected both to verify that the number of COUNTYFIPS codes were the same in each. This process yielded the following results:
counties %>% summarise(n_distinct(COUNTYFIPS)) #3109## n_distinct(COUNTYFIPS)
## 1 3109
test_innerjoin_wtmean %>% summarise(n_distinct(COUNTYFIPS)) #3140## n_distinct(COUNTYFIPS)
## 1 3140
So 31 rows appear to be different between counties and test_innerjoin_wtmean in terms of COUNTYFIPS values
setdiff(counties$COUNTYFIPS, test_innerjoin_wtmean$COUNTYFIPS)## [1] "46113" "51515"
Yields 2 COUNTYFIPS, 46113 and 51515 that are different, both present in counties but not in test_innerjoin_wtmean
These 2 COUNTYFIPS are:
Shannon County, SD (FIPS code = 46113). After investigation, this county was renamed Oglala Lakota County and assigned a new FIPS code (46102) effective in 2014.
Bedford City, VA (FIPS code=51515). After investigation, I discovered that in 2013, Bedford City, an independent city, merged with Bedford County (FIPS code=51019). Apparently, Bedford City category codes are the same as those for Bedford County.
However, this prompted me to ask “what about the other rows that are different? What are those rows?”
Using the alternate sequence for setdiff, with test_innerjoin_wt mean first:
setdiff(test_innerjoin_wtmean$COUNTYFIPS, counties$COUNTYFIPS)## [1] "02013" "02016" "02020" "02050" "02060" "02068" "02070" "02090" "02100"
## [10] "02105" "02110" "02122" "02130" "02150" "02164" "02170" "02180" "02185"
## [19] "02188" "02195" "02198" "02220" "02230" "02240" "02261" "02275" "02282"
## [28] "02290" "15001" "15003" "15005" "15007" "15009"
Therefore there were 33 differences, which represent COUNTYFIPS that are present in test_innerjoin_wtmean but not present in counties.
These differences are for Alaska (State code 02) and Hawaii (State code 15). These states are not represented in counties.
Based on this analysis, I concluded that an innerjoin of counties and test_innerjoin_wtmean would likely omit the Hawaii and Alaska data that is in test_innerjoin_wtmean. For purposes of this project, this omission would be acceptable as my choropleth creation tools are presently limited to the continental United States.
Given the conclusion above, I felt it was safe to proceed with an innerjoin of counties and test_innerjoin_wtmean by COUNTYFIPS, to create a new dataframe, test_counties_inner.
test_counties_inner<- inner_join(counties, test_innerjoin_wtmean, by = 'COUNTYFIPS')I inspected test_counties_inner to verify that Shannon County, Bedford City, Hawaii, and Alaska rows (as described above) had been removed as expected.
str(test_counties_inner) ## Classes 'sf' and 'data.frame': 3107 obs. of 20 variables:
## $ GEO_ID : Factor w/ 3221 levels "0500000US01001",..: 1 5 9 11 17 23 26 33 40 42 ...
## $ STATE : chr "01" "01" "01" "01" ...
## $ COUNTY : Factor w/ 325 levels "001","003","005",..: 1 6 13 16 23 30 34 43 53 55 ...
## $ NAME : Factor w/ 1909 levels "Abbeville","Acadia",..: 89 174 309 341 387 460 567 730 986 1009 ...
## $ LSAD : Factor w/ 8 levels "Borough","CA",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ CENSUSAREA : num 594 645 597 693 593 ...
## $ COUNTYFIPS : chr "01001" "01009" "01017" "01021" ...
## $ COUNTY.x : chr "Autauga County" "Blount County" "Chambers County" "Chilton County" ...
## $ STATE.x : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ REGION.x : chr "South" "South" "South" "South" ...
## $ County_Pop : num 55221 57710 34079 43819 54444 ...
## $ ACS_MEDIAN_HH_INC : num 51281 45813 34177 41627 40576 ...
## $ ACS_PER_CAPITA_INC : num 24974 20532 21071 21399 22546 ...
## $ ACS_PCT_POV_AIAN : num 52.2 6.6 36.84 6.33 18.24 ...
## $ ACS_MEDIAN_HOME_VALUE: num 141300 119800 80800 100100 99400 ...
## $ ACS_PCT_UNINSURED : num 10.1 11.6 13.9 16 11.1 ...
## $ CHR_PCT_ADULT_OBESITY: num 37.5 34.4 40.3 35.3 32.5 37.3 36.1 42 35 33.6 ...
## $ CHR_PCT_DIABETES : num 14.2 14.3 17 14.7 17.6 15.6 14.6 16.1 14.9 12.4 ...
## $ wm_FARA : num 54.8 13.4 30.7 13.9 52.8 ...
## $ geometry :sfc_MULTIPOLYGON of length 3107; first list element: List of 1
## ..$ :List of 1
## .. ..$ : num [1:10, 1:2] -86.5 -86.7 -86.8 -86.9 -86.9 ...
## ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
## - attr(*, "sf_column")= chr "geometry"
## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
## ..- attr(*, "names")= chr [1:19] "GEO_ID" "STATE" "COUNTY" "NAME" ...
View(test_counties_inner)Inspection revealed 3107 observations of 20 variables and confirmed that Hawaii and Alaska data, as well as Shannon County, SD and Bedford City, VA data, had been removed.
I began my exploratory analysis, first by inspecting column names in test_counties_inner to verify that all variables needed for this analysis were present.
colnames(test_counties_inner)## [1] "GEO_ID" "STATE" "COUNTY"
## [4] "NAME" "LSAD" "CENSUSAREA"
## [7] "COUNTYFIPS" "COUNTY.x" "STATE.x"
## [10] "REGION.x" "County_Pop" "ACS_MEDIAN_HH_INC"
## [13] "ACS_PER_CAPITA_INC" "ACS_PCT_POV_AIAN" "ACS_MEDIAN_HOME_VALUE"
## [16] "ACS_PCT_UNINSURED" "CHR_PCT_ADULT_OBESITY" "CHR_PCT_DIABETES"
## [19] "wm_FARA" "geometry"
Before beginning the exploratory analysis, I also double-checked to verify there were no counties that were missing low food access (wm_FARA) values. I determined that in fact, there were no counties missing wm_FARA values.
y2<-test_counties_inner$wm_FARA
countmissinglowaccess<-length(which(is.na(y2)))
countmissinglowaccess ## [1] 0
Also before beginning the exploratory analysis, I verified that no counties were missing values for diabetes prevalence.
y_dm<-test_counties_inner$CHR_PCT_DIABETES
countmissingdm<-length(which(is.na(y_dm)))
countmissingdm ## [1] 0
I began the exploratory analysis by generating summary statistics for the low access data, notably a mean and a median low access (wm_FARA) percentage across continental U.S. counties.
Mean low access percentage
mean_LA<- test_counties_inner %>% dplyr::summarise(Mean_LA_County = mean(wm_FARA, na.rm = TRUE))
mean_LA ## Simple feature collection with 1 feature and 1 field
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -124.7332 ymin: 24.5447 xmax: -66.9499 ymax: 49.38436
## Geodetic CRS: WGS 84
## Mean_LA_County geometry
## 1 38.1614 MULTIPOLYGON (((-85.56778 4...
Therefore mean county low access percentage across continental U.S. counties was 38.16%.
Median low access percentage
median_LA<- test_counties_inner %>% dplyr::summarise(Median_LA_County = median(wm_FARA, na.rm = TRUE))
median_LA ## Simple feature collection with 1 feature and 1 field
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -124.7332 ymin: 24.5447 xmax: -66.9499 ymax: 49.38436
## Geodetic CRS: WGS 84
## Median_LA_County geometry
## 1 36.44304 MULTIPOLYGON (((-85.56778 4...
Therefore median county low access percentage across continental U.S. counties was 36.44%.
Next I calculated the number of counties that had 100% of their population with low access to food and identified them.
y<-test_counties_inner$wm_FARA
counttoplowaccess<-length(which(y == 100))
counttoplowaccess ## [1] 55
Therefore 55 counties had 100% of the population with low access to food.
Here are the identities of those 100% low access counties.
Lowest_access_counties<- test_counties_inner %>% dplyr::select(COUNTYFIPS, REGION.x, STATE.x, COUNTY.x, wm_FARA) %>% slice_max(wm_FARA)
Lowest_access_counties## Simple feature collection with 55 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -120.4952 ymin: 29.6235 xmax: -79.31228 ymax: 48.99902
## Geodetic CRS: WGS 84
## First 10 features:
## COUNTYFIPS REGION.x STATE.x COUNTY.x wm_FARA
## 1 20075 Midwest Kansas Hamilton County 100
## 2 20203 Midwest Kansas Wichita County 100
## 3 30037 West Montana Golden Valley County 100
## 4 31183 Midwest Nebraska Wheeler County 100
## 5 30059 West Montana Meagher County 100
## 6 30069 West Montana Petroleum County 100
## 7 30107 West Montana Wheatland County 100
## 8 31009 Midwest Nebraska Blaine County 100
## 9 31085 Midwest Nebraska Hayes County 100
## 10 31117 Midwest Nebraska McPherson County 100
## geometry
## 1 MULTIPOLYGON (((-101.5675 3...
## 2 MULTIPOLYGON (((-101.5671 3...
## 3 MULTIPOLYGON (((-108.7801 4...
## 4 MULTIPOLYGON (((-98.29576 4...
## 5 MULTIPOLYGON (((-110.2819 4...
## 6 MULTIPOLYGON (((-108.6309 4...
## 7 MULTIPOLYGON (((-109.6543 4...
## 8 MULTIPOLYGON (((-99.68696 4...
## 9 MULTIPOLYGON (((-100.7774 4...
## 10 MULTIPOLYGON (((-101.2697 4...
View(Lowest_access_counties)I then calculated the counties where greater than 75% of the population had low food access and identified these counties.
yseventyfive<- test_counties_inner$wm_FARA
countseventyfive<- length(which(yseventyfive > 75))
countseventyfive ## [1] 165
Therefore in 165 counties greater than 75% of the population had low food access.
Here are those at least 75% of the population with low access counties identified.
Topquartile<- test_counties_inner %>% dplyr::select(COUNTYFIPS, REGION.x, STATE.x, COUNTY.x, wm_FARA) %>% filter(wm_FARA > 75)
Topquartile## Simple feature collection with 165 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -123.7279 ymin: 26.06684 xmax: -71.2248 ymax: 49.00027
## Geodetic CRS: WGS 84
## First 10 features:
## COUNTYFIPS REGION.x STATE.x COUNTY.x wm_FARA
## 1 08014 West Colorado Broomfield County 79.02029
## 2 12111 South Florida St. Lucie County 78.43846
## 3 13059 South Georgia Clarke County 76.37750
## 4 08111 West Colorado San Juan County 99.24000
## 5 20049 Midwest Kansas Elk County 99.36000
## 6 20075 Midwest Kansas Hamilton County 100.00000
## 7 20203 Midwest Kansas Wichita County 100.00000
## 8 30037 West Montana Golden Valley County 100.00000
## 9 31163 Midwest Nebraska Sherman County 83.43000
## 10 31171 Midwest Nebraska Thomas County 99.94000
## geometry
## 1 MULTIPOLYGON (((-105.1473 3...
## 2 MULTIPOLYGON (((-80.67786 2...
## 3 MULTIPOLYGON (((-83.36003 3...
## 4 MULTIPOLYGON (((-107.7383 3...
## 5 MULTIPOLYGON (((-96.52569 3...
## 6 MULTIPOLYGON (((-101.5675 3...
## 7 MULTIPOLYGON (((-101.5671 3...
## 8 MULTIPOLYGON (((-108.7801 4...
## 9 MULTIPOLYGON (((-98.74853 4...
## 10 MULTIPOLYGON (((-100.8461 4...
View(Topquartile)Using the 75% of population with low access threshold above, I then determined which regions had the lowest food access.
Topregions<- Topquartile %>% dplyr::summarise(count(REGION.x))
Topregions## Simple feature collection with 4 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -123.7279 ymin: 26.06684 xmax: -71.2248 ymax: 49.00027
## Geodetic CRS: WGS 84
## x freq geometry
## 1 Midwest 67 MULTIPOLYGON (((-101.2277 4...
## 2 Northeast 1 MULTIPOLYGON (((-101.2277 4...
## 3 South 55 MULTIPOLYGON (((-101.2277 4...
## 4 West 42 MULTIPOLYGON (((-101.2277 4...
Topregions_summary<- dplyr::arrange(Topregions, desc(freq))
Topregions_summary## Simple feature collection with 4 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -123.7279 ymin: 26.06684 xmax: -71.2248 ymax: 49.00027
## Geodetic CRS: WGS 84
## x freq geometry
## 1 Midwest 67 MULTIPOLYGON (((-101.2277 4...
## 2 South 55 MULTIPOLYGON (((-101.2277 4...
## 3 West 42 MULTIPOLYGON (((-101.2277 4...
## 4 Northeast 1 MULTIPOLYGON (((-101.2277 4...
As noted in the table, ranked by number of counties with >75% of population with low food access: Midwest had 67 counties; South had 55 counties; West had 42 counties, Northeast had 1 county, for 165 counties total.
I calculated how many counties have 0% of their population with low food access and identified them.
y1<-test_counties_inner$wm_FARA
countleastlowaccess<-length(which(y1 == 0))
countleastlowaccess ## [1] 38
Therefore 38 counties have 0% of the population with low access to food.
These 38 counties where 0% of the population had low access to food were identified.
Highest_access_counties<- test_counties_inner %>% dplyr::select(COUNTYFIPS, REGION.x, STATE.x, COUNTY.x, wm_FARA) %>% slice_min(wm_FARA)
Highest_access_counties## Simple feature collection with 38 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -105.6903 ymin: 31.43369 xmax: -76.24528 ymax: 45.21074
## Geodetic CRS: WGS 84
## First 10 features:
## COUNTYFIPS REGION.x STATE.x COUNTY.x wm_FARA
## 1 08047 West Colorado Gilpin County 0
## 2 17035 Midwest Illinois Cumberland County 0
## 3 18115 Midwest Indiana Ohio County 0
## 4 21077 South Kentucky Gallatin County 0
## 5 37103 South North Carolina Jones County 0
## 6 47057 South Tennessee Grainger County 0
## 7 48379 South Texas Rains County 0
## 8 48425 South Texas Somervell County 0
## 9 13109 South Georgia Evans County 0
## 10 13119 South Georgia Franklin County 0
## geometry
## 1 MULTIPOLYGON (((-105.3978 3...
## 2 MULTIPOLYGON (((-88.01212 3...
## 3 MULTIPOLYGON (((-84.84945 3...
## 4 MULTIPOLYGON (((-84.66011 3...
## 5 MULTIPOLYGON (((-77.47372 3...
## 6 MULTIPOLYGON (((-83.66746 3...
## 7 MULTIPOLYGON (((-95.66539 3...
## 8 MULTIPOLYGON (((-97.86486 3...
## 9 MULTIPOLYGON (((-81.96907 3...
## 10 MULTIPOLYGON (((-83.35527 3...
View(Highest_access_counties)I continued the exploratory analysis by generating summary statistics about the diabetes prevalence percentage (CHR_PCT_DIABETES) data.
Mean percentage prevalence of population with diabetes across U.S. counties.
mean_DM<- test_counties_inner %>% dplyr::summarise(Mean_DM_County = mean(CHR_PCT_DIABETES, na.rm = TRUE))
mean_DM ## Simple feature collection with 1 feature and 1 field
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -124.7332 ymin: 24.5447 xmax: -66.9499 ymax: 49.38436
## Geodetic CRS: WGS 84
## Mean_DM_County geometry
## 1 11.6589 MULTIPOLYGON (((-85.56778 4...
Therefore across continental U.S. counties, the mean percentage of the population with diabetes was 11.66%.
Median percentage prevalence of population with diabetes across U.S. counties.
median_DM<- test_counties_inner %>% dplyr::summarise(Median_DM_County = median(CHR_PCT_DIABETES, na.rm = TRUE))
median_DM ## Simple feature collection with 1 feature and 1 field
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -124.7332 ymin: 24.5447 xmax: -66.9499 ymax: 49.38436
## Geodetic CRS: WGS 84
## Median_DM_County geometry
## 1 11.5 MULTIPOLYGON (((-85.56778 4...
Therefore across continental U.S. counties, median percentage of population with diabetes was 11.5%.
I identified the top 100 counties in terms of greatest percentage of population with diabetes in 2015.
Diabetes_high<- test_counties_inner %>% dplyr::select(COUNTYFIPS, REGION.x, STATE.x, COUNTY.x, CHR_PCT_DIABETES) %>% slice_max(CHR_PCT_DIABETES, n = 100)
View(Diabetes_high) Note that the search for the top 100 actually returned 109 counties given that a number of counties were tied in their diabetes prevalence percentages.
I determined what regions were represented by these top 109 counties with the greatest percentage of population with diabetes.
Topregions_DM<- Diabetes_high %>% dplyr::summarize(count(REGION.x) %>% dplyr::arrange(desc(freq)))
View(Topregions_DM)Therefore 105 of the top 109 counties in terms of greatest percentages of population with diabetes were in the South. 4 were in the Midwest.
I determined the 100 counties with the lowest percentage of their population with diabetes.
Diabetes_low<- test_counties_inner %>% dplyr::select(COUNTYFIPS, REGION.x, STATE.x, COUNTY.x, CHR_PCT_DIABETES) %>% slice_min(CHR_PCT_DIABETES, n = 100)
View(Diabetes_low) Note that the search for the 100 counties with the lowest percentage of their populations with diabetes actually returned 111 entries given a number of counties were tied in their population percentages with diabetes.
I determined what regions were represented by these 111 counties with the lowest percentage of population with diabetes.
Bottomregions_DM<- Diabetes_low %>% dplyr::summarize(count(REGION.x) %>% dplyr::arrange(desc(freq)))
View(Bottomregions_DM)Therefore, in rank order by lowest percentage of population with diabetes among these 111 counties: West (69 counties), Midwest (25 counties), Northeast (9 counties), South (8 counties)
Here I prepared for another component of the exploratory analysis, choropleth and leaflet creation.
library(RColorBrewer)
my_theme <- function() {
theme_minimal() +
theme(axis.line = element_blank(),
axis.text = element_blank(),
axis.title = element_blank(),
panel.grid = element_line(color = "white"),
legend.key.size = unit(0.5, "cm"),
legend.text = element_text(size = 10),
legend.title = element_text(size = 12),
plot.title = element_text(size = 16))
}
myPalette <- colorRampPalette(brewer.pal(9, "YlOrRd"))
library(leaflet)
pal_fun4 <- colorNumeric("inferno", NULL) # inferno from viridisI generated a low food access choropleth.
choro_lowaccess<- ggplot() +
geom_sf(data = test_counties_inner, aes(fill = wm_FARA), lwd = 0) +
my_theme() +
ggtitle("Percent of population with low food access, by U.S. county, 2015") +
scale_fill_continuous(type = "viridis", guide = "colorbar") +
north(x.min = -75.28031, y.min = 39.86747, # add north bar with north()
x.max = -74.95575, y.max = 40.13793, # set map boundaries with x.min, y.min, etc..
symbol=12, # select north arrow symbol
anchor = c(x = -75, y = 39.93)) + # set north bar location
scalebar(x.min = -75.28031, y.min = 39.86747, # add scalebar with scalebar()
x.max = -74.95575, y.max = 40.13793, # set map boundaries
dist = 5, dist_unit = 'km', # set scalebar segment length = 5km
transform = TRUE, # TRUE for decimal degree coordinates
model = "WGS84") # specify CR
choro_lowaccessI generated a diabetes prevalence percentage choropleth.
choro_diabetes<- ggplot() +
geom_sf(data = test_counties_inner, aes(fill = CHR_PCT_DIABETES), lwd = 0) +
my_theme() +
ggtitle("Percent of population with diabetes, by U.S. county, 2015") +
scale_fill_continuous(type = "viridis", guide = "colorbar") +
north(x.min = -75.28031, y.min = 39.86747, # add north bar with north()
x.max = -74.95575, y.max = 40.13793, # set map boundaries with x.min, y.min, etc..
symbol=12, # select north arrow symbol
anchor = c(x = -75, y = 39.93)) + # set north bar location
scalebar(x.min = -75.28031, y.min = 39.86747, # add scalebar with scalebar()
x.max = -74.95575, y.max = 40.13793, # set map boundaries
dist = 5, dist_unit = 'km', # set scalebar segment length = 5km
transform = TRUE, # TRUE for decimal degree coordinates
model = "WGS84") # specify CR
choro_diabetesI created a leaflet for low food access.
#Popup message - low access
pu_message1 <- paste0(test_counties_inner$COUNTY.x, ", ", test_counties_inner$STATE.x, "<br>Low food access percentage 2015: ", round(test_counties_inner$wm_FARA, 1), "%")
#Interactive map with legend and scalebar
leaflet_lowaccess<-leaflet(test_counties_inner) %>%
addPolygons(fillColor = ~pal_fun4(wm_FARA),
popup = pu_message1) %>%
addTiles() %>% addLegend("bottomright", pal=pal_fun4, values = ~wm_FARA,
title = '2015 Low food access percentages by U.S. county',
opacity = 1) %>% addScaleBar()
leaflet_lowaccessI created a leaflet for diabetes prevalence.
#Popup message - diabetes
pu_message2 <- paste0(test_counties_inner$COUNTY.x, ", ", test_counties_inner$STATE.x, "<br>Diabetes prevalence 2015: ", round(test_counties_inner$CHR_PCT_DIABETES, 1), "%")
#Interactive map with legend and scalebar
leaflet_diabetes<- leaflet(test_counties_inner) %>%
addPolygons(fillColor = ~pal_fun4(CHR_PCT_DIABETES),
popup = pu_message2) %>%
addTiles() %>% addLegend("bottomright", pal=pal_fun4, values = ~CHR_PCT_DIABETES,
title = '2015 Diabetes prevalence rates by U.S. county',
opacity = 1) %>% addScaleBar()
leaflet_diabetesNext, I generated a scatterplot with smooth line showing the relationship between low food access (wm_FARA) and diabetes prevalence percentage (CHR_PCT_DIABETES).
library(modelsummary)##
## Attaching package: 'modelsummary'
## The following object is masked from 'package:tidycensus':
##
## get_estimates
ggplot(test_counties_inner, aes(x = wm_FARA, y = CHR_PCT_DIABETES)) +
geom_point(color = "darkseagreen3") +
geom_smooth(method = "lm", color = "green4") + ggtitle("Scatterplot: Low food access vs. Diabetes prevalence")## `geom_smooth()` using formula 'y ~ x'
I calculated the Pearson correlation coefficient for the above relationship, which showed a mild to moderate negative relationship between wm_FARA and CHR_PCT_DIABETES.
cor(test_counties_inner$wm_FARA, test_counties_inner$CHR_PCT_DIABETES, use = "complete.obs") ## [1] -0.3060672
I then created an unadjusted linear regression model to further assess the relationship between wm_FARA and CHR_PCT_DIABETES.
LA_DM_lm_fit <- lm(CHR_PCT_DIABETES ~ wm_FARA, data = test_counties_inner)
LA_DM_lm_fit##
## Call:
## lm(formula = CHR_PCT_DIABETES ~ wm_FARA, data = test_counties_inner)
##
## Coefficients:
## (Intercept) wm_FARA
## 12.93230 -0.03337
I then summarized this linear model, LA_DM_lm_fit. Findings indicate a very small negative relationship that was statistically significant between wm_FARA and CHR_PCT_DIABETES.
summary_LA_DM_lm_fit <- summary.lm(LA_DM_lm_fit) #Summary of results
summary_LA_DM_lm_fit##
## Call:
## lm(formula = CHR_PCT_DIABETES ~ wm_FARA, data = test_counties_inner)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7474 -1.7094 -0.0926 1.5925 9.4299
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.932301 0.083690 154.53 <2e-16 ***
## wm_FARA -0.033369 0.001863 -17.91 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.462 on 3105 degrees of freedom
## Multiple R-squared: 0.09368, Adjusted R-squared: 0.09339
## F-statistic: 320.9 on 1 and 3105 DF, p-value: < 2.2e-16
Next, I explored potentially confounding variables in the relationship between wm_FARA and CHR_PCT_DIABETES. I started by creating scatterplots showing the relationship between four variables - median household income, per capita income, obesity percentage by county, and percentage of a county population that was uninsured - with diabetes prevalence. I also created linear models to assess the relationship between each of these variables and diabetes prevalence.
Median household income
plot_medhhincome<-ggplot(test_counties_inner, aes(x = ACS_MEDIAN_HH_INC, y = CHR_PCT_DIABETES)) +
geom_point(color = "sienna4") + ggtitle("Med. HH income r/t to diabetes")
summary((lm(CHR_PCT_DIABETES ~ ACS_MEDIAN_HH_INC, data = test_counties_inner)))##
## Call:
## lm(formula = CHR_PCT_DIABETES ~ ACS_MEDIAN_HH_INC, data = test_counties_inner)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.4296 -1.3736 -0.0069 1.3735 7.4564
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.729e+01 1.524e-01 113.42 <2e-16 ***
## ACS_MEDIAN_HH_INC -1.207e-04 3.163e-06 -38.16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.134 on 3104 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.3193, Adjusted R-squared: 0.3191
## F-statistic: 1456 on 1 and 3104 DF, p-value: < 2.2e-16
plot_medhhincome## Warning: Removed 1 rows containing missing values (geom_point).
Per capita income
plot_percapincome<-ggplot(test_counties_inner, aes(x = ACS_PER_CAPITA_INC, y = CHR_PCT_DIABETES)) +
geom_point(color = "sienna4") + ggtitle("Per capita income r/t diabetes")
summary((lm(CHR_PCT_DIABETES ~ ACS_PER_CAPITA_INC, data = test_counties_inner)))##
## Call:
## lm(formula = CHR_PCT_DIABETES ~ ACS_PER_CAPITA_INC, data = test_counties_inner)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8249 -1.4033 0.0386 1.3774 8.5143
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.752e+01 1.693e-01 103.47 <2e-16 ***
## ACS_PER_CAPITA_INC -2.413e-04 6.786e-06 -35.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.18 on 3105 degrees of freedom
## Multiple R-squared: 0.2895, Adjusted R-squared: 0.2892
## F-statistic: 1265 on 1 and 3105 DF, p-value: < 2.2e-16
plot_percapincomeObesity
plot_obesity<-ggplot(test_counties_inner, aes(x = CHR_PCT_ADULT_OBESITY, y = CHR_PCT_DIABETES)) +
geom_point(color = "sienna4") +
ggtitle("Obesity pct r/t to diabetes")
summary((lm(CHR_PCT_DIABETES ~ CHR_PCT_ADULT_OBESITY, data = test_counties_inner)))##
## Call:
## lm(formula = CHR_PCT_DIABETES ~ CHR_PCT_ADULT_OBESITY, data = test_counties_inner)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.1775 -1.2811 -0.0812 1.2155 7.1327
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.507953 0.243063 -2.09 0.0367 *
## CHR_PCT_ADULT_OBESITY 0.379292 0.007501 50.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.915 on 3105 degrees of freedom
## Multiple R-squared: 0.4516, Adjusted R-squared: 0.4514
## F-statistic: 2557 on 1 and 3105 DF, p-value: < 2.2e-16
plot_obesityUninsured status percentage
plot_uninsured<-ggplot(test_counties_inner, aes(x = ACS_PCT_UNINSURED, y = CHR_PCT_DIABETES)) +
geom_point(color = "sienna4") + ggtitle("Uninsured pct r/t to diabetes")
summary((lm(CHR_PCT_DIABETES ~ ACS_PCT_UNINSURED, data = test_counties_inner)))##
## Call:
## lm(formula = CHR_PCT_DIABETES ~ ACS_PCT_UNINSURED, data = test_counties_inner)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9911 -1.7029 -0.0521 1.6202 8.8322
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.077338 0.122469 82.28 <2e-16 ***
## ACS_PCT_UNINSURED 0.118761 0.008552 13.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.51 on 3105 degrees of freedom
## Multiple R-squared: 0.05847, Adjusted R-squared: 0.05817
## F-statistic: 192.8 on 1 and 3105 DF, p-value: < 2.2e-16
plot_uninsured
Of these individual linear models, as shown above, the variable with the
strongest predictive relationship with diabetes prevalence was obesity,
which has the greatest magnitude coefficient at 0.38, a significant
relationship with diabetes prevalence, and an adjusted R2 value of
0.45.
To see all of the above plots at a glance, I created a grid of the above plots.
plot_grid(plot_medhhincome, plot_percapincome, plot_obesity, plot_uninsured, labels = "AUTO")## Warning: Removed 1 rows containing missing values (geom_point).
In considering a multivariable linear model to assess the relationship between wm_FARA and CHR_PCT_DIABETES, I needed to give consideration to the likely collinearity of some of the above variables. As such, for example, I limited the multivariable linear models to exclude potentially confounding variables that were likely to be collinear, such as median household income and per capita income.
In this linear model, I adjusted for obesity and uninsured status percentage.
lm_2<- lm(CHR_PCT_DIABETES ~ wm_FARA + CHR_PCT_ADULT_OBESITY + ACS_PCT_UNINSURED, data = test_counties_inner)
summary(lm_2)##
## Call:
## lm(formula = CHR_PCT_DIABETES ~ wm_FARA + CHR_PCT_ADULT_OBESITY +
## ACS_PCT_UNINSURED, data = test_counties_inner)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.9650 -1.1821 -0.0536 1.1685 7.2007
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.108139 0.254901 -0.424 0.671
## wm_FARA -0.020168 0.001378 -14.636 <2e-16 ***
## CHR_PCT_ADULT_OBESITY 0.351733 0.007152 49.182 <2e-16 ***
## ACS_PCT_UNINSURED 0.094156 0.006114 15.400 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.789 on 3103 degrees of freedom
## Multiple R-squared: 0.5217, Adjusted R-squared: 0.5213
## F-statistic: 1128 on 3 and 3103 DF, p-value: < 2.2e-16
In this linear model, I adjusted for obesity and median household income
lm_3<- lm(CHR_PCT_DIABETES ~ wm_FARA + CHR_PCT_ADULT_OBESITY + ACS_MEDIAN_HH_INC, data = test_counties_inner)
summary(lm_3)##
## Call:
## lm(formula = CHR_PCT_DIABETES ~ wm_FARA + CHR_PCT_ADULT_OBESITY +
## ACS_MEDIAN_HH_INC, data = test_counties_inner)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1975 -1.1735 -0.0481 1.1394 5.7481
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.079e+00 3.253e-01 18.68 <2e-16 ***
## wm_FARA -1.510e-02 1.341e-03 -11.25 <2e-16 ***
## CHR_PCT_ADULT_OBESITY 2.883e-01 7.499e-03 38.44 <2e-16 ***
## ACS_MEDIAN_HH_INC -6.627e-05 2.866e-06 -23.12 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.715 on 3102 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.5608, Adjusted R-squared: 0.5604
## F-statistic: 1320 on 3 and 3102 DF, p-value: < 2.2e-16
In these adjusted, multivariable linear models, the small, negative, and statistically significant relationship between wm_FARA and CHR_PCT_DIABETES remained. The other variables assessed also had small but significant relationships with CHR_PCT_DIABETES.
This project had several limitations.
There was not an evident way to capture purely type 2 diabetes data from available surveys. Interestingly, the CHR_PCT_DIABETES variable counts any diagnosis of diabetes, whether type 1, type 2, or other. In this project, the diabetes of interest is type 2 diabetes given that type 2 diabetes is associated with modifiable lifestyle factors. Still, given that 90-95% of all U.S. diabetes cases are type 2, this project is still capable of providing a reasonably strong estimate of any association between low access and type 2 diabetes prevalence.(9)
Given limitations in dataset availability, I was only able to do this analysis for the most recent year that both low access and diabetes data were available, 2015; ideally I would have had more recent data and been able to inspect changes over several years.
The exclusion of non-continental United States counties and territories should be noted, particularly in considering the summary statistics. Given that these areas external to the continental United States were not analyzed, it is possible that these reported statistics could be significantly different if one were to use data that included the United States (including territories) as a whole.
On a similar note, 1625 tracts were excluded from the analysis due to missing FARA_PCT_LA_HALF_10_POP data. It is unclear why these tracts had missing data. It is also unknown whether, had data been available, these tracts would have had a distribution of low access values similar to counties for which data was available (with similar mean, median) or perhaps have skewed toward lower or higher food access.
While I have used a linear regression model for this descriptive analysis of two continuous variables, in reality, a linear regression model is not optimal for outcomes measured as percentages, as these are bounded by 0 at the low end and 100 at the high end. It is possible that a generalized linear model may be a better fit for this type of data but more investigation would be required to confirm this as the most appropriate model. This project did not attempt to perform an exhaustive analysis of all the potentially confounding variables. For example, average or median age by county might have been a potential confounder to the relationship between wm_FARA percentage and diabetes percentage, but this particular variable was not evaluated.
Every one unit increase in the weighted mean low food access percentage was associated with a very slight but statistically significant decrease (0.01 to 0.03) in the diabetes prevalence percentage rate in both the unadjusted and adjusted linear models. Of note, the low adjusted R-squared value (ranges from 0 to 1 with 1 being ideal) of 0.09339 from the unadjusted model tells us that very little of the variance in diabetes prevalence percentage can be explained by changes in wm_FARA.
While statistical significance for this relationship was achieved, the very small magnitude of the wm_FARA coefficient likely renders this finding clinically insignificant; that is, there is effectively no meaningful relationship between wm_FARA percentage and the percentage prevalence of diabetes.
In multivariable linear models, obesity, uninsured status, and median household income also had very small but statistically significant associations with the prevalence of diabetes, positive for obesity and negative for uninsured status and median household income.
Still, while perhaps there is no meaningful relationship between worsening food access and diabetes prevalence in continental United States counties in 2015 as identified in this project, improving food security in the United States as a general matter remains no less important for a host of other reasons. Indeed, professional societies, including the American College of Physicians, have recently recognized the multiple health and societal consequences of food insecurity and called on governments to do more to address this pressing and widespread problem.(10)
Healthy People 2030. United States Department of Health and Human Services. Available at: https://health.gov/healthypeople/priority-areas/social-determinants-health. Accessed 12/6/22.
The cost of diabetes. American Diabetes Association. Available at: https://diabetes.org/about-us/statistics/cost-diabetes#:~:text=People%20with%20diagnosed%20diabetes%20incur,in%20the%20absence%20of%20diabetes. Accessed 12/6/22.
Tuomilehto J, Lindström J, Eriksson JG, et al. Prevention of type 2 diabetes mellitus by changes in lifestyle among subjects with impaired glucose tolerance. N Engl J Med. 2001;344(18):1343-1350. doi:10.1056/NEJM200105033441801
Tello, M. Healthy lifestyle can prevent diabetes (and even reverse it). Harvard Health Blog. 2018; September 6. Available at: https://www.health.harvard.edu/blog/healthy-lifestyle-can-prevent-diabetes-and-even-reverse-it-2018090514698
Food Security in the U.S.: Measurement. Economic Research Service, United States Department of Agriculture. https://www.ers.usda.gov/topics/food-nutrition-assistance/food-security-in-the-u-s/measurement/#insecurity. Accessed 12/6/22.
Mapping Food Deserts in the United States. Economic Research Service, United States Department of Agriculture. https://www.ers.usda.gov/amber-waves/2011/december/data-feature-mapping-food-deserts-in-the-us/. Accessed 12/6/22.
Odegaard AO, Koh WP, Yuan JM, Gross MD, Pereira MA. Western-style fast food intake and cardiometabolic risk in an Eastern country. Circulation. 2012;126(2):182-188. doi:10.1161/CIRCULATIONAHA.111.084004
State Health Facts: Health Insurance Coverage of the Total Population. Kaiser Family Foundation. Available at: https://www.kff.org/other/state-indicator/total-population/?currentTimeframe=5&selectedDistributions=medicaid--medicare&sortModel=%7B%22colId%22:%22Location%22,%22sort%22:%22asc%22%7D. Accessed 12/6/22.
Type 2 Diabetes. Centers for Disease Control and Prevention. Available at: https://www.cdc.gov/diabetes/basics/type2.html#:~:text=Healthy%20eating%20is%20your%20recipe,adults%20are%20also%20developing%20it. Accessed 12/8/22.
Serchen J, Atiq O, Hilden D; Health and Public Policy Committee of the American College of Physicians. Strengthening Food and Nutrition Security to Promote Public Health in the United States: A Position Paper From the American College of Physicians. Ann Intern Med. 2022;175(8):1170-1171. doi:10.7326/M22-0390